Managing multiple data-frames

Presented to EdinbR R-Users Group, 2018-07-18

Russ Hyde, University of Glasgow

2018-07-18

Background and Links:

@POG_LRC / @GUcancersci / @bloodwise_uk
- I’m a postdoc bioinformatician at The Paul O’Gorman (POG) Leukaemia Research Centre (University of Glasgow)
- … working for Prof. Mhairi Copland (POG) and Dr. David Vetrie (Wolfson-Wohl Cancer Research Centre)
- … on a Bloodwise-funded grant
- … into chronic-myeloid leukaemia
@haematobot
- Personal mumblings about code / analysis / bioinformatics and seemingly very little else …
https://biolearnr.blogspot.com/
- Even more mumblings

Preamble

See https://github.com/russHyde/polyply

# Dependencies:
# - purrr, methods, rlang, tidygraph, dplyr 
if(! "polyply" %in% installed.packages()){
  require("devtools")
  devtools::install_github(
    repo = "russHyde/polyply", dependencies = FALSE
  )
}

suppressPackageStartupMessages(
  library(polyply)
)

Data-Modelling

Tidy Data and the Normal-Forms

In tidy data¹:

TD1 - Each variable forms a column.
TD2 - Each observation forms a row.
TD3 - Each type of observational unit forms a table.
[TD4 - A key permitting table-joins is present]

See also, Boyce-Codd Normal-Forms² and relational-database-design.

?? TD5 - A tidy way of encapsulating your nicely decomposed tables
?? TD6 - An explicit workflow for combining your tables back together

¹: http://vita.had.co.nz/papers/tidy-data.html
²: https://en.wikipedia.org/wiki/First_normal_form

Common Untidy Data Structures

Tidy-data / normal-forms in R

$\downarrow$ duplication
play nicely with some important things (ggplot2 etc)

But untidy data-structures are useful if they:

$\uparrow$ access efficiency
$\downarrow$ code complexity
play nicely with other important things

`Biobase::ExpressionSet`

Biobase::ExpressionSet()

Figure made with DiagrammeR

`Biobase::ExpressionSet` (cont.)

Conversion of the assayData to meet tidy-data standards:

m # our assayData

##       sample1 sample2 sample3
## gene1    12.2   111.0   129.0
## gene2    19.1    10.5   123.0
## gene3     0.5     3.4     1.1

Doesn’t meet tidy-data standards:

rows correspond to features, columns to samples
not all variables are in columns (since row-IDs are meaningful)
entries are the same ‘type’ of variable

Easy fix³:

m2 <- reshape2::melt(
    m,
    varnames = c("feature_id", "sample_id"),
    as.is = TRUE
  )

head(m2, 4)

ABCDEFGHIJ0123456789

	feature_id <chr>	sample_id <chr>	value <dbl>
1	gene1	sample1	12.2
2	gene2	sample1	19.1
3	gene3	sample1	0.5
4	gene1	sample2	111.0

³: ... or as.data.frame / rownames_to_column / gather

But …

Matrix representation was more dense
Lost all encapsulation
(After modifying featureData / phenoData to match)
- Have to join rather than index
- Have to keep track of multiple data-frames, rather than one data-structure

That multi-data-frame thing

For a reasonably complex project:

tidy-data / normal-forms mean more data-frames

Wanted:

a lightweight approach to working with multiple ‘conceptually-related’ data-frames
that plays nicely with tidyverse verbs
that feeds into ggplot2
that plays nicely with untidy data-structures I use all the time

`tidygraph` already (sort of) does this

Graph theory

Basics of ‘graph theory’ speak

A graph is made up of two sets:

V, a set of vertices:
- aka nodes, actors, …
E, a set of edges:
- pairwise relationships between vertices
- aka interactions, lines, arcs, …
Need to store attributes for both nodes and edges

`tbl_graph` data structure

tidygraph is really a wrapper around the package igraph

data("Koenigsberg", package = "igraphdata")
tg <- tidygraph::as_tbl_graph(Koenigsberg)

# Nodes data shows up first:
tg

## # A tbl_graph: 4 nodes and 7 edges
## #
## # An undirected multigraph with 1 component
## #
## # Node Data: 4 x 2 (active)
##   name                Euler_letter
##   <chr>               <chr>       
## 1 Altstadt-Loebenicht B           
## 2 Kneiphof            A           
## 3 Vorstadt-Haberberg  C           
## 4 Lomse               D           
## #
## # Edge Data: 7 x 4
##    from    to Euler_letter name           
##   <int> <int> <chr>        <chr>          
## 1     1     2 a            Kraemer Bruecke
## 2     1     2 b            Schmiedebruecke
## 3     1     4 f            Holzbruecke    
## # ... with 4 more rows

`tbl_graph` data structure

# If we make the 'edges' active, the edge-data shows up first:
activate(tg, edges)

## # A tbl_graph: 4 nodes and 7 edges
## #
## # An undirected multigraph with 1 component
## #
## # Edge Data: 7 x 4 (active)
##    from    to Euler_letter name           
##   <int> <int> <chr>        <chr>          
## 1     1     2 a            Kraemer Bruecke
## 2     1     2 b            Schmiedebruecke
## 3     1     4 f            Holzbruecke    
## 4     2     4 e            Honigbruecke   
## 5     3     4 g            Hohe Bruecke   
## 6     2     3 c            Gruene Bruecke 
## # ... with 1 more row
## #
## # Node Data: 4 x 2
##   name                Euler_letter
##   <chr>               <chr>       
## 1 Altstadt-Loebenicht B           
## 2 Kneiphof            A           
## 3 Vorstadt-Haberberg  C           
## # ... with 1 more row

The `activate` verb

Think of the tbl_graph as list[nodes, edges]

To modify the contents of a given data-frame, activate it:

tg %>%
  activate(edges) %>%
  mutate(weight = nchar(name))

## # A tbl_graph: 4 nodes and 7 edges
## #
## # An undirected multigraph with 1 component
## #
## # Edge Data: 7 x 5 (active)
##    from    to Euler_letter name            weight
##   <int> <int> <chr>        <chr>            <int>
## 1     1     2 a            Kraemer Bruecke     15
## 2     1     2 b            Schmiedebruecke     15
## 3     1     4 f            Holzbruecke         11
## 4     2     4 e            Honigbruecke        12
## 5     3     4 g            Hohe Bruecke        12
## 6     2     3 c            Gruene Bruecke      14
## # ... with 1 more row
## #
## # Node Data: 4 x 2
##   name                Euler_letter
##   <chr>               <chr>       
## 1 Altstadt-Loebenicht B           
## 2 Kneiphof            A           
## 3 Vorstadt-Haberberg  C           
## # ... with 1 more row

`polyply` and multiple, linked data-frames

`polyply`

Aim:

multiple data-frames in one data-structure
- $\rightarrow$ class poly_frame: extends list`
- poly_frame: [list[data-frame], merge_fn]
mutation / filtering
merging

Exported functions

as_poly_frame
- convert a data-structure into a poly_frame
activate
- choose a data-frame from within the poly_frame
filter
- modify the contents of the active data-frame
merge
- user defined data-frame combiner (default: reduce(inner_join)(df_list))
others to be added (mutate / select etc)

Examples

ExpressionSet Example

data("leukemiasEset", package = "leukemiasEset")
leuk <- leukemiasEset
leuk

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 20172 features, 60 samples 
##   element names: exprs, se.exprs 
## protocolData
##   sampleNames: GSM330151.CEL GSM330153.CEL ... GSM331677.CEL (60
##     total)
##   varLabels: ScanDate
##   varMetadata: labelDescription
## phenoData
##   sampleNames: GSM330151.CEL GSM330153.CEL ... GSM331677.CEL (60
##     total)
##   varLabels: Project Tissue ... Subtype (5 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: genemapperhgu133plus2

Construct a poly-frame from an ExpressionSet

leuk_pf <- list(
  exprs = reshape2::melt(
    exprs(leuk),
    as.is = TRUE,
    varnames = c("feature_id", "sample_id")
  ),
  pheno = tibble::rownames_to_column(
    phenoData(leuk)@data,
    var = "sample_id"
  )
) %>%
  as_poly_frame()

What did we just make?

purrr::map(leuk_pf, head)

## $exprs
##        feature_id     sample_id    value
## 1 ENSG00000000003 GSM330151.CEL 3.386743
## 2 ENSG00000000005 GSM330151.CEL 3.539030
## 3 ENSG00000000419 GSM330151.CEL 9.822758
## 4 ENSG00000000457 GSM330151.CEL 4.747283
## 5 ENSG00000000460 GSM330151.CEL 3.307188
## 6 ENSG00000000938 GSM330151.CEL 8.230721
## 
## $pheno
##       sample_id Project     Tissue LeukemiaType
## 1 GSM330151.CEL   Mile1 BoneMarrow          ALL
## 2 GSM330153.CEL   Mile1 BoneMarrow          ALL
## 3 GSM330154.CEL   Mile1 BoneMarrow          ALL
## 4 GSM330157.CEL   Mile1 BoneMarrow          ALL
## 5 GSM330171.CEL   Mile1 BoneMarrow          ALL
## 6 GSM330174.CEL   Mile1 BoneMarrow          ALL
##           LeukemiaTypeFullName                         Subtype
## 1 Acute Lymphoblastic Leukemia c_ALL/Pre_B_ALL without t(9 22)
## 2 Acute Lymphoblastic Leukemia c_ALL/Pre_B_ALL without t(9 22)
## 3 Acute Lymphoblastic Leukemia c_ALL/Pre_B_ALL without t(9 22)
## 4 Acute Lymphoblastic Leukemia c_ALL/Pre_B_ALL without t(9 22)
## 5 Acute Lymphoblastic Leukemia c_ALL/Pre_B_ALL without t(9 22)
## 6 Acute Lymphoblastic Leukemia c_ALL/Pre_B_ALL without t(9 22)

Filter and plot:

my_plot <- leuk_pf %>%
  # At first, data-frame `exprs` is active
  filter(feature_id %in% c("ENSG00000000003", "ENSG00000000005")) %>%
  # Select a different data-frame for filtering:
  # - you can use non-standard-evaluation in `activate`
  activate(pheno) %>%
  # only look at myeloid leukaemias
  filter(LeukemiaType %in% c("AML", "CML")) %>%
  # default merge: fold an inner-join
  merge() %>%
  ggplot()

Filter and plot(cont.)

my_plot +
  geom_boxplot(aes(x = LeukemiaType, y = value)) +
  facet_wrap(~ feature_id) +
  ggtitle("These might not be the most interesting genes in the dataset ...")

Taxonomy and brains

data(Animals, package = "MASS")
animals <- Animals %>%
  tibble::rownames_to_column(var = "common_name") %>%
  mutate(
    common_name = str_replace(
      common_name, "Dipliodocus", "Diplodocus"
    )
  )

common_to_species <- data.frame(
  common_name = c("Mountain beaver", "Cow", "Grey wolf", "Goat", "Guinea pig",
    "Diplodocus", "Asian elephant", "Donkey", "Horse", "Potar monkey", "Cat"
  ),
  species = c("Aplodontia rufa", "Bos taurus", "Canis lupus",
    "Capra hircus",
    "Cavia porcellus", "Diplodocus longus",
    "Elephas maximus", "Equus africanus asinus",
    "Equus ferus caballus", NA, "Felis silvestris"
  )
)

Taxonomies (cont.)

taxon_data <- taxize::classification(
  x = common_to_species$species,
  get = "order",
  db = "ncbi"
)
  
taxonomy <- Filter(is.data.frame, taxon_data) %>%
  bind_rows(.id = "species") %>%
  select(-id) %>%
  filter(rank %in% c("order")) %>%
  tidyr::spread(key = rank, value = name)

head(taxonomy)

ABCDEFGHIJ0123456789

	species <chr>	order <chr>
1	Aplodontia rufa	Rodentia
2	Canis lupus	Carnivora
3	Cavia porcellus	Rodentia
4	Elephas maximus	Proboscidea
5	Equus africanus asinus	Perissodactyla
6	Felis silvestris	Carnivora

Taxonomies & brains (cont.)

as_poly_frame(
  list(animals, common_to_species, taxonomy)
) %>%
  merge() %>%
  ggplot(aes(x = body, y = brain, col = order)) +
  geom_point() +
  xlim(0, NA) + ylim(0, NA)

Managing multiple data-frames

Presented to EdinbR R-Users Group, 2018-07-18

Background and Links:

Preamble

Data-Modelling

Tidy Data and the Normal-Forms

Common Untidy Data Structures

Biobase::ExpressionSet

Biobase::ExpressionSet (cont.)

But …

That multi-data-frame thing

tidygraph already (sort of) does this

Graph theory

Basics of ‘graph theory’ speak

tbl_graph data structure

tbl_graph data structure

The activate verb

polyply and multiple, linked data-frames

polyply

Exported functions

Examples

ExpressionSet Example

Construct a poly-frame from an ExpressionSet

What did we just make?

Filter and plot:

Filter and plot(cont.)

Taxonomy and brains

Taxonomies (cont.)

Taxonomies & brains (cont.)

Thanks

`Biobase::ExpressionSet`

`Biobase::ExpressionSet` (cont.)

`tidygraph` already (sort of) does this

`tbl_graph` data structure

`tbl_graph` data structure

The `activate` verb

`polyply` and multiple, linked data-frames

`polyply`